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Abstract 

Variational principles for the rate distortion (RD) theory in lossy compression are 
formulated within the ambit of the generalized nonextensive statistics of Tsallis, for 
values of the nonextensivity parameter satisfying < q < 1 and q > 1 . Alternating 
minimization numerical schemes to evaluate the nonextensive RD function, are de- 
rived. Numerical simulations demonstrate the efficacy of generalized statistics RD 
models. 
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1 Introduction 



The generalized (nonadditive) statistics of Tsallis' [1,2] has recently been the 
focus of much attention in statistical physics, and allied disciplines. Nonad- 
ditive statistics 1 , which generalizes the extensive Boitzmann-Gibbs-Shannon 
statistics, has much utility in a wide spectrum of disciplines ranging from com- 
plex systems and condensed matter physics to financial mathematics 2 . This 
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paper investigates nonadditive statistics within the context of Rate Distortion 
(RD) theory in lossy data compression. 

RD theory constitutes one of the cornerstones of contemporary information 
theory [3, 4], and is a prominent example of source coding. It addresses the 
problem of determining the minimal amount of entropy (or information) R 
that should be communicated over a channel, so that a compressed (reduced) 
representation of the source (input signal) can be approximately reconstructed 
at the receiver (output signal) without exceeding a given distortion D. 

For a thorough exposition of RD theory Section 13 of [3] should be consulted. 
Consider a discrete random variable X G X 3 called the source alphabet or the 
codebook, and, another discrete random variable X G X which is a compressed 
representation of X. The compressed representation X is sometimes referred 
to as the reproduction alphabet or the quantized codebook. By definition, quan- 
tization is the process of approximating a continuous range of values (or a 
very large set of possible discrete values) by a relatively small set of discrete 
symbols or integer values. 

The mapping of x G X to x G X is characterized by a conditional (transition) 
probability p(x\x). The information rate distortion function is obtained by 
minimizing the generalized mutual entropy I q {X; X) (defined in Section 2) 4 
over all normalized p(x\x). Note that in RD theory I q (X;X) is known as 
the compression information (see Section 4). Here, q is the nonextensivity 
parameter [1, 2] defined in Section 2. 

RD theory has found applications in diverse disciplines, which include data 
compression and machine learning. Deterministic annealing [5,6] and the in- 
formation bottleneck method [7] are two influential paradigms in machine 
learning, that are closely related to RD theory. The representation of RD the- 
ory in the form of a variational principle, expressed within the framework of 
the Shannon information theory, has been established [3] . The computational 
implementation of the RD problem is achieved by application of the Blahut- 
Arimoto alternating minimization algorithm [3,8], derived from the celebrated 
Csiszar-Tusnady theory [9]. 

Since the work on nonextensive source coding by Landsberg and Vedral [10], a 
number of studies on the information theoretic aspects of generalized statistics 
pertinent to coding related problems have been performed by Yamano [11], 
Furuichi [12, 13], and Suyari [14], amongst others. The source coding theorem, 
central to the RD problem, has been derived by Yamano [15] using generalized 
statistics. A preliminary work by Venkatesan [16] has investigated into the re- 

3 Calligraphic fonts denote sets. 

4 The absence of a principled nonextensive channel coding theorem prompts the 
use of the term mutual entropy instead of mutual information. 
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formulation of RD theory and the information bottleneck method, within the 
framework of nonextensive statistics. 

Generalized statistics has utilized a number of constraints to define expecta- 
tion values. The linear constraints originally employed by Tsallis of the form 
(A) = J^PiAi [1], were convenient owing to their similarity to the maximum 

i 

entropy constraints. The linear constraints were abandoned because of diffi- 
culties encountered in obtaining an acceptable form for the partition function. 
These were subsequently replaced by the Curado-Tsallis (C-T) [17] constraints 
(A) = J^PiAi . The C-T constraints were later discarded on physics related 

i 

grounds, (1) 7^ 1, and replaced by the normalized Tsallis-Mendes-Plastino (T- 

q 

M-P) constraints [18] {{A}) = J] y^S-A; . The dependence of the expectation 

i 2-1 Pi 

i 

value on the normalized pdf renders the canonical probability distributions 
obtained using the T-M-P constraints to be self-referential. A fourth form 
of constraint, prominent in nonextensive statistics, is the optimal Lagrange 
multiplier (OLM) constraint [19, 20]. The OLM constraint removes the self- 
referent iality caused by the T-M-P constraints by introducing centered mean 
values. 

A recent formulation by Ferri, Martinez, and Plastino [21] has demonstrated a 
methodology to "rescue" the linear constraints in maximum (Tsallis) entropy 
models, and, has related solutions obtained using the linear, C-T, and, T-M-P 
constraints. This formulation [21] has commonality with the studies of Wada 
and Scarfone [22], Bashkirov [23], and, Di Sisto et. al. [24]. This paper extends 
the work in [16], by employing the Ferri- Martinez- Plastino formulation [21] to 
formulate self-consistent nonextensive RD models for < q < 1 and q > 1. 

Tsallis statistics is described by two separate ranges of the nonextensivity 
parameter, i.e. < q < 1 and q > 1. Within the context of coding theory and 
learning theory, each range of q has its own specific utility. Un-normalized 
Tsallis entropies take different forms for < q < 1 and q > 1, respectively. 
For example, as defined in Section 2, for < q < 1, the generalized mutual 
entropy is of the form I 0<q<1 (X; X) = - Y/p (x, x) In, (^gf ) . 

For q > 1, as described in Section 2, the generalized mutual entropy is de- 
fined by I q>1 (X;X) = S g (X) + S q (X) - S q (X,X), where S q (X) and_S q (X) 
are the marginal Tsallis entropies for the random variables X and X, and, 
S q (X,X) is the joint Tsallis entropy. Unlike the Boltzmann-Gibbs-Shannon 
case, I 0<q<1 (X; X) can never acquire the form of I q> i, and vice versa. While 
the form of I 0<q< i(X; X) is important in a number of applications of practical 
interest in coding theory and learning theory, un-normalized Tsallis entropies 
for q > 1 demonstrate a number of important properties such as the general- 
ized data processing inequality and the generalized Fano inequality [12]. 



3 



It may be noted that normalized Tsallis entropies do exhibit the generalized 
data processing inequality and the generalized Fano inequality [11]. As pointed 
out by Abe [25], normalized Tsallis entropies do not possess Lesche stability. 
However, for applications in communications theory and learning theory, the 
local stability criterion of Yamano [26] may be evoked to justify the use of 
normalized Tsallis entropies described in terms of escort probabilities. Ongo- 
ing studies, which will be reported elsewhere, have established the relation 
between the solutions of generalized RD theory for un-normalized Tsallis en- 
tropies using linear constraints that are reported in this paper, and, normalized 
Tsallis entropies using T-M-P constraints defined in terms of escort probabili- 
ties, in a manner similar to that employed by Wada and Scarfone [27]. 

To reconcile the different forms of the generalized mutual entropy for < q < 1 
and q > 1, the additive duality of nonextensive statistics [28] is evoked in Sec- 
tion 3. This results in dual Tsallis entropies characterized by re-parameterization 
of the nonextensivity parameter q* = 2 — q, results in a dual generalized RD 
theory. An important feature of dual Tsallis entropies is the similarity of the 
forms of the Tsallis entropies with their counterparts in Boltzmann-Gibbs- 
Shannon statistics, the difference being log(«) — > ln q * (•) [27]. 

In this paper, Tsallis entropies characterized by a nonextensivity parame- 
ter q are called q-Tsallis entropies. Similarly, those characterized by the re- 
parameterized nonextensivity parameter q* are called q* -Tsallis entropies. The 
two forms of Tsallis entropies may be used in conjunction to obtain a self- 
consistent description of nonextensive phenomena [29]. 

Summing up, this Section outlines the material presented in this paper. The 
basic theory of q- Tsallis entropies, and, q* -Tsallis entropies is described in Sec- 
tions 2 and 3, respectively. Section 3 also derives select information theoretic 
properties for q* -Tsallis entropies. Section 4 defines the generalized statistics 
RD problem, and, describes alternating minimization numerical algorithms 
within the ambit of nonextensive statistics. The mathematical basis under- 
lying nonextensive alternating minimization algorithms (rate distortion), and 
subsequently, alternating maximization algorithms (channel capacity), is also 
derived in Section 4. This is accomplished in Lemma 1 of this paper, by ex- 
tending the positivity conditions in Lemma 13.8.1 in [3] to the case of Tsallis 
statistics. Section 5 extends prior studies [16] by deriving variational princi- 
ples for both, a generalized RD theory, and, a dual generalized RD theory. 
The practical implementation of a nonextensive alternating minimization al- 
gorithm is also described in Section 5. 

Section 6 presents numerical simulations that demonstrate the efficacy of the 
generalized RD theory vis-a-vis equivalent formulations derived within the 
Boltzmann-Gibbs-Shannon framework. It is demonstrated that the general- 
ized RD theory possesses a lower threshold for the compression information, 
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as compared with equivalent extensive Boltzmann-Gibbs-Shannon RD mod- 
els. This feature has immense potential significance in data compression ap- 
plications. Section 7 concludes this paper by summarizing salient results, and, 
briefly highlighting qualitative extensions that will be presented in forthcom- 
ing publications. 



2 Tsallis entropies 



By definition, the un-normalized Tsallis entropy, is defined in terms of discrete 
variables as [1, 2] 

S q (X) = fzj— ;£p(*) = i. (!) 

H X 



The constant q is referred to as the nonextensivity parameter. Given two in- 
dependent variables X and Y , one of the fundamental consequences of nonex- 
tensivity is demonstrated by the pseudo-additivity relation 

S q (XY) = S q (X) + S q (Y) + (l-q) S q (X) S q (Y) . (2) 



Here, (1) and (2) imply that extensive statistics is recovered as q — > 1. Taking 
the limit q — > 1 in (1) and evoking l'Hospital's rule, S q (X) — > S (X), i.e., 
the Shannon entropy. The generalized Kullback-Leibler divergence (K-Ld) is 
of the form [30, 31] 

(Pixiy- 1 _ 1 

D q K - L \P(X) \\r(X)} = EpW^—i • (3) 



Akin to the Tsallis entropy, the generalized K-Ld obeys the pseudo-additivity 
relation [31]. Nonextensive statistics is intimately related to q-deformed alge- 
bra and calculus (see [32] and the references within). The q-deformed logarithm 
and exponential are defined as [32] 

and, 

exp q (x)= I 1 + (!-«) *\ ;l + (l-9)x>0 

0; otherwise, 
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respectively. Before proceeding further, three important relations from q- deformed 
algebra, employed in this paper, are stated [12, 32] 



ln 9 (f ) = yQ 1 ( ln " x ~ ln< ? y "> ' 
\n q (xy) = \n q x + x l ~ q \n q y, 

and, 

ln,(i) = -xi-HiigX. 



(5) 



The un-normalized Tsallis entropy (1), conditional Tsallis entropy, joint Tsallis 
entropy , and, the generalized K-Ld (3) may thus be written as [12, 30, 31] 



S q (X) = -J2P (x) q \n q p (x) , 

X 

S q (x X) = -J2J2p(x,x) q \n q p(x\ x), 

v y x x 

S q (x,x) = - E E P (x, x) q \n q p (x, x) 

v ' X X 

= Sg(X) + Sg(X\X) = Sg(X) + S q (X\X), 

and, 



(6) 



respectively. The joint convexity of the generalized K-Ld for q > is estab- 
lished by the relation [33, 34] 



U K-L 



E W„ 



< EVaD q K _ L [p a \\ T a \ , 



n a > 0, and,J2Va = 1- 



(7) 



In the framework of Boltzmann-Gibbs-Shannon statistics, the mutual infor- 
mation may be expressed as [3] I (X; X^j = S (X) — S (^X\ X^j = S (Jt^j — 

six 



Xj = I(X; X). This is the manifestation of the symmetry of the mutual 
information within the Boltzmann-Gibbs-Shannon model. Within the frame- 
work of nonextensive statistics, the inequalities (sub-additivities) 



S q (X| X) < S q (X) , and, S q (x| x) < S q (x) , 



(8) 



do not generally hold true for < q < 1 . Note that (8) is only valid for q > 1 
as noted by Furuichi [12]. The sub-additivities (8) are required to hold true, 
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for the generalized mutual entropy to be defined by 



i q (x-x) =s q (x)-s q (x\x) = s q (x) -s q (x\x) 

= S q (X) + S q (x)-S q (X,X) =I q (X;X);q>l. 



(9) 



Note that Furuichi [12] has presented a thorough and exhaustive qualitative 
extension of the analysis of Daroczy [35], and has proven that (9) holds true 
for un-normalized Tsallis entropies for q > 1. Further, for normalized Tsallis 
entropies, Yamano [11] has elegantly established the symmetry described by 
(9) in the range < q < 1. 

The un-normalized generalized mutual entropy I g (X;X), in the range < 
q < 1, is defined by the generalized K-Ld between the the joint probability 
p(X,X) and the marginal probabilities p(X) and p(X), respectively 5 



I q {X;X)=-E P (x,x)ln q (^f) 



(D q K-L 



p(x,x) \\p(X)p(x) 



(10) 



p(X X plX 



p(x) 



;0 < q < 1. 



The sub-additivities (8) do not generally hold true in the range < q < 1. 
This forecloses the prospect of transparently establishing the symmetry of the 
generalized mutual entropy in the range < q < 1, in the manner akin to (9), 
and, the Boltzmann-Gibbs-Shannon model. 

From (10), the symmetry of the generalized mutual entropies I q (X;X) and 
I q (X;X) may be summarized as 



I q (X;X) 



U K-L 



p(X,X) p(X)p(X 



D 



K-L [P (x\ X)\\ p (x)} 



p(x) 



D K-L [P(x\x)\\p{x)} 



p(x) 



— U K-L 



P(X,X) \\p(x)p(X)]=I q (X;X). 



It is important to note that the generalized K-Ld (6) D q K _ L \p(X)\\r{X)\ is 
not symmetric. However, as described in (11), if X and X are discrete ran- 
dom variables with marginal's p(X) and p(X) and joint distribution p(X, X), 
then the generalized mutual entropy for < q < 1 defined in terms of the 



5 The log(») in the extensive convex mutual information [3] is replaced by ln (? (») 
(4). Also refer to Theorem 3 in this paper. 
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generalized K-Ld is symmetric. This is a consequence of the Kolmogorov the- 
orem which states that joint distributions are always invariant to the order- 
ing of the random variables [36]. Specifically, the generalized K-Ld between 
p(X,X) and p(X)p(X) is identical to the generalized K-Ld between p(X,X) 
and p(X)p(X). The symmetry between the two distinct forms of generalized 
mutual entropy for < q < 1 and q* > 1, in a manner analogous to (9) and the 
Botlzmann-Gibbs-Shannon case, may be proven with the aid of the additive 
duality. The proof for this symmetry relation is identical to the derivation of 
Theorem 3 in Section 3 of this paper, and may be obtained on interchanging 
the nonextensivity parameters q and q* = 2 — q, wherever they occur in (19). 



3 Dual Tsallis information theoretic measures 



This paper makes prominent use of the additive duality in nonextensive statis- 
tics. Setting q* = 2 — q, from (4) the dual deformed logarithm and exponential 
are defined as 

hv (x) = - In, (i) , and, exp,. (x) = exp \_ x) ■ (12) 



The reader is referred to Naudts [29] for further details. 
A dual Tsallis entropy defined by 

S q .(X) = - y £p(x)hi q .p(x), (13) 



has already been studied in a maximum (Tsallis) entropy setting (for exam- 
ple, see Wada and Scarfone [27]). It is important to note that the q* = 2 — q 
duality has been studied within the Sharma-Taneja-Mittal framework by Kan- 
niadakis, et. al. [37]. The following properties, however, have yet to be proven 
for dual Tsallis entropies: (a) the validity of (9) for dual Tsallis entropies, and, 
(b) the adherence of the dual Tsallis entropies to the chain rule [3, 12]. The 
task is undertaken below. 



Theorem 1: The dual Tsallis joint entropy obeys the relation 

S q * (X, X) = S q . (X) + S q * (X\X). (14) 

1-9 _1 



Note that q* = 2 - q, \n q * x = x 
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Proof: From (5) and (6) 

S q (X, X) = - E E P (x, x) 9 \n q p (x, x) 

v ' X X 

= -EEp(s,x) 9 ln, (p(x)p(x\x)) 

X X 

— E Ep (^i x) q \n q p(x) + p {x) 1 " 9 ln q p (x\ x 



X X 



, ( 15 ) 

= Ep(x,x) ln q [^)+Ep(x)Ep(x\x) In, 

=>• S q * (X,X) = -J2P (x) ln g . p (x) —J2J2p(x,x) \n q * p(x\x) 

V ' X X x 

= s q * (x) + s q *(x\x) . 

In conclusion, the dual Tsallis entropies acquire a form identical to the Boltzmann- 
Gibbs-Shannon entropies, with ln q * (•) replacing log(«) . 

Theorem 2: Let Xi, X2, X3, ...,X n be random variables obeying the probability 
distribution p (xi, X2, X3, ...,x n ), then we have the chain rule 

n 

Sq* (X 1 ,X 2 , X 3 , X n ) = E S q * (Xi\ Xi_i, Xi) . (16) 

1=1 

Proof: Theorem 2 is proved by induction on n. Assuming (16) holds true for 
some n, (15) yields 

S q * (Xi, X 2 , X 3 , X n+1 ) 

= S q * (Xi, X2, X3, X n ) + S q * (X n+ i \ X n , X\) (17) 

n 

E S q * (Xi\ Xi-i, ...,Xi) + S q * (X n+ i\ X n , ...,Xi) , 



i=l 



which implies that (16) holds true for n + 1. Theorem 2 implies that the dual 
Tsallis entropies can support a parametrically extended information theory. 

Theorem 3: The dual convex generalized mutual entropy is described by 6 

I q *(X;X) = -ZZp(x,x)ln q *(^) ^ 
(9 *= Q) S q (X) + Sq(X) - S q (X, X) = I q (X; X); < q* < 1, and, q > 1. 



Note that in this paper, q* — > g denotes re-parameterization of the nonexten- 
sivity parameter defining the information theoretic quantity from q* to q by 



6 Here " — >" denotes a re-parameterization of the nonextensivity parameter, and, is 
not a limit. 
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setting q* = 2 — q. Likewise, q — > q* denotes re-parameterization from q to q* 
by setting q = 2 — q*. 



4r (*;*) = - EE P *) In,, (^f) ) = -EEp(v) ln r (jggy) 
= -EEp(s,x) (ln g . p (x) + p (x) {1 ~ q * ] \n q * (^y)) 



Here, (a) follows from (5), (b) follows from setting q* = 2 — q, and, (c) follows 
from (9). Theorem 3 acquires a certain significance especially when it may be 
proven [11] that the convex form of the generalized mutual entropy (10) can 
never be expressed in the form of Tsallis entropies (9). 

Theorem 3 demonstrates that such a relation is indeed possible by commencing 
with q* -Tsallis mutual entropy and performing manipulations that scale the 
generalized mutual entropy from g*-space to g-space, yielding a form akin 
to (9). Interchanging the range of values and the connotations of q and q* 
respectively, such that 0<q<l(q*>l), Theorem 3 may be modified to 
justify defining the convex q-Tsallis mutual entropy by (10). 



4 Nonextensive rate distortion theory and alternating minimiza- 
tion schemes 

4-1 Overview of rate distortion theory 

For a thorough exposition of RD theory, the interested reader is referred to 
Section 13 in [3]. Let X be a discrete random variables with a finite set of 
possible values X, distributed according to p(x) . Here, X is the source alphabet. 
Let X denote the reproduction alphabet (a compressed representation of X). 
Let, X = {Xi, X n } and X = {X ± , X m }, m < n. By definition, the 
partitioning of X dictates the manner in which each element of the source 
alphabet X relates to each element of the compressed representation X. In RD 



Proof: 




-EEp(x)p(x\ x)p {xf-^ p (x\ xf~^ hv (j^y 
— — ^2p(x) q p ( x \_ -1 + J2 Hp (x) q p (x\ x) q \a q p{x\ x) 



(19) 



^ I q *(X;X) q = q S q (x) -S q (x\x) 

( = } S q (X) + S q (X) - S q (X, X) q = q * I q * (X; X) . 
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theory, the partitioning is dictated by the normalized transition probability 
p(x\x). When the assignment of the elements of X X is probabilistic, then 
the partitioning is referred to as soft partitioning. When the assignment is 
deterministic, it is referred to as hard partitioning. Initially, RD undergoes a 
soft partitioning, and then as the process progresses, hard partitioning occurs. 

A standard measure that defines the quality of compression is the rate of a 
code with respect to a channel transmitting between X and X. In generalized 
statistics, this quantity is the generalized mutual entropy I q (X;X) 7 . The 
quantity I q (X;X) is defined as the compression information, which is evalu- 
ated on the basis of the joint probability p(x)p(x\x). Low values of I q (X;X) 
imply a more compact representation, and thus better compression. An ex- 
treme case would be where X has only one element (cardinality of X = 1, 
\X\ = 1), resulting in I q {X;X) = 0. 

The physics underlying RD remains unchanged regardless of the range of q. 
However, the generalized mutual entropy possesses different properties for 
nonextensivity parameters in i) < q < 1 and ii) q > 1 respectively. For exam- 
ple, information theoretic and physics based features intrinsic to RD theory, 
that may not be comprehensively described using one range of nonextensivity 
parameter, may be analyzed with greater coherence using the additive duality. 
To obtain a deeper insight into the generic process of nonextensive RD, the 
case of q > 1 is briefly examined. 

This is a simple example where nonextensive models, having nonextensivity 
parameters in the ranges i) < q < 1 and ii) q > 1, may be employed to com- 
plement each other. The models may help to understand the physics of a cer- 
tain problems by employing the additive duality. In this case, the two different 
ranges of q are employed with the aid of the additive duality, to qualitatively 
and quantitatively describe the extreme limits (scenarios) of the generalized 
RD model, using a sender-receiver description. These extreme limits are: i) 
no communication between X and X, and, ii) perfect communication between 
X and X, respectively. 

Specifically, in Section 5.1 of this paper, the mathematical (quantitative) anal- 
ysis of the generalized RD model is performed in the range < q < 1 using 
the generalized mutual entropy described by (10). However, a deeper qualita- 
tive description of the process may be obtained by complementing the form 
of the generalized mutual entropy used in Section 5.1, with the form of the 
generalized mutual entropy valid in the range q > 1 (described by (9)), with 
the aid of the additive duality. As stated in Sections 1 and 2, the generalized 
mutual entropy I q (X;X) cannot be stated in the form described by (9) for 
< q < 1. Employing the definition of the generalized mutual entropy for 

7 Note that these arguments are adapted from the Boltzmann-Gibbs-Shannon RD 
model, where q = 1. 
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q > 1 (9), we obtain I q (X; X) = S q (X) - S q {X\X). Let X be inhabited by a 
sender and X by a receiver. Here, S q (X) is the prior uncertainty the receiver 
has about the sender's signal, which is diminished by S q (X\X) as the signal 
is received. The difference yields I q (X; X). As an example, in case there is no 
communication at all, then S q {X\X) = S q (X), and, I q (X;X) = 0. 

Alternatively, if the communication channel is perfect and the received signal 
X is identical to the signal X at the sender (the source alphabet is simply 
copied as the reproduction alphabet), then S q (X\X) = and I q (X;X) = 
S q (X) = S q (X). In Boltzmann-Gibbs-Shannon statistics (q = 1), this is called 
the Shannon upper bound [3] . In nonextensive statistics, this quantity is here- 
after referred to as the Tsallis upper bound. 

The compression information may always be reduced by using only a single 
value of X, thereby ignoring the details in X. This requires an additional 
constraint called the distortion measure. The distortion measure is denoted 
by d(x, x) and is taken to be the Euclidean square distance for most problems 
in science and engineering [3,6]. 

Given d(x,x), the partitioning of X induced by p(x\x) has an expected dis- 
tortion D =< d(x, x) >p( x ,x)= J2 P (x, x)d (x, x) 8 . Note that D is mathemat- 
ically equivalent to the internal energy in statistical physics. 

The RD function is [3, 4] 

R q {D)= min I q (X;X), (20) 



where, R q (D) is the minimum of the compression information. This implies 
that R q (D) is the minimum achievable nonextensive compression information, 
where the minimization is carried out over all normalized transition probabil- 
ities p(x\x), for which the distortion constraint is satisfied. As depicted in Fig. 
1, R q (D) is a non- increasing convex function of D in the distortion- compression 
plane. The R q (D) function separates the distortion- compression plane into two 
regions. The region above the curve is known as the rate distortion region, and, 
corresponds to all achievable distortion-compression pairs {D; I q (X; X)}. 

On the other hand, the region below the curve is known as the non- achievable 
region, where compression cannot occur. The major feature of nonextensive 
RD models is that the RD curves inhabit the non- achievable region of those 
obtained from Boltzmann-Gibbs-Shannon statistics. This implies that nonex- 
tensive RD models can perform data compression in regimes not achievable 

8 Note that in this paper, (•) p (.) denotes the expectation with respect to the prob- 
ability p(»). 
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by equivalent extensive RD models. 



Obtaining R q (D) involves minimization of the nonextensive RD Lagrangian 
(free energy) [4] 

L q RD [ P (x\x)} =I q (X;X)+P(d(x,x)) pix #, (21) 



subject to the normalization of the conditional probability. Here, f3 = qf3 
(see Section 5.1) is a Lagrange multiplier called the nonextensive trade-off 
parameter, where, f3 is the inverse temperature in Boltzmann-Gibbs-Shannon 
statistics [5, 6]. Note that $ is the nonextensive mathematical equivalent of the 
Boltzmann-Gibbs-Shannon inverse temperature (3. It is important to note that 
(3 and (3 have different physical connotations. The definition (3 = q(3 is specific 
to this paper, and is chosen to facilitate comparison between the extensive 
and nonextensive RD models. More complex forms of the nonextensive trade- 
off parameter have been investigated into. The results of this study will be 
reported elsewhere. 

Here, (21) implies that RD theory is a trade-off between the compression 
information and the expected distortion. Taking the variation of (21) over all 
normalized distributions p(x\x), yields 



5L% [p(x \x)\ = SI q (X; X) + 05 (d (x, x)) p{Xj - } = 



(*;*) 



5{d{x,x)) p{xi) 



(22) 
= -/?, 



Here, (22) implies that the rate of change of the generalized mutual entropy 
with respect to the expected distortion is called the RD curve, and a tangent 
drawn at any point on the RD curve has a slope —f3. To prove that the 
conditional distribution p(x\x) represents a stationary point of L q RD [p(x\x)], 
(21) is subjected to a variational minimization contingent to the normalization 
of p(x\x). This procedure is detailed in Section 5 of this paper, for both q and 
q* Tsallis mutual entropies. 



4-2 The nonextensive alternating minimization scheme 



The basis for alternating minimization algorithm (a class of algorithms that 
include the Blahut-Arimoto scheme) 9 is to find the minimum distance be- 
tween two convex sets A and B in lZ n 10 . First, a point a G A is chosen and a 



9 See Chapter 13 in [3]. 

10 Note that the nonextensive alternating minimization algorithm presented herein is 
not referred to as the nonextensive Blahut-Arimoto algorithm, despite being an obvi- 
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point b G B closest to it is found. This value of b is fixed and its closest point 
in A is then found. The above process is repeated till the algorithm converges 
to a (global) minimum distance. 

Extrapolating the Csiszar-Tusnady theory [9] to the nonextensive domain 
for two convex sets of probability distributions, and considering the gener- 
alized mutual entropy (10) as a distance measure between the joint prob- 
ability p(x)p(x\x), and the marginal probabilities p(x) and p(x) [3, 9], by 
extending the definition of the generalized K-Ld (3), the nonextensive alter- 
nating minimization algorithm converges to the minimum D q K _ L [»] between 
the two convex sets of probability distributions (p(x\x) and p(x), respectively). 
It is important to note that the nonextensive Pythagorean identity (triangu- 
lar equality), which forms the basis of any extension of the Csiszar-Tusnady 
theory to the nonextensive regime, has been established by Dukkipati et. al. 
[38]. 

Before proceeding any further, it is judicious to state the leitmotif of this Sec- 
tion. The procedure behind the alternating minimization algorithm described 
herein assumes an a-priori minimization of the nonextensive RD Lagrangian 
(21), with respect to conditional probabilities p(x\x), for all normalized p(x\x), 
using the calculus of variations. This variational minimization yields a canon- 
ical conditional probability p(x\x). The variational minimization procedure is 
presented in Section 5. 

Here, p(x\x) corresponds to the joint probability p(x,x) = p(x)p(x\x), which 
is employed to evaluate the expected distortion D =< d(x, x) > p ( x ,x)- It is im- 
portant to prove that, p(x) is a marginal probability (or marginal) of p(x, x). 
This criterion ensures that extremization with respect to p(x) further mini- 
mizes (21). Section 5.3 provides a discussion of the nonextensive alternating 
minimization algorithm, from a practitioner's viewpoint. Now, a result that 
establishes the positivity condition for nonextensive alternating minimization 
schemes, central to RD theory, is proven. Subsequently, this result is extended 
to establish the positivity condition for nonextensive alternating maximization 
schemes, required for calculating the channel capacity (Lemma 13.8.1 in [3]). 

Lemma 1: Let p(x)p(x\x) be a given joint distribution. The prior distribution 
p(x) that minimizes D q K _ L \p{X)p{X\X)\\p{X)p(X)] is the marginal distribu- 
tion to p*(x) corresponding to p(x\x), i.e. 



D q 

U K-L 



p(X)p(x\x) \\p(X)p* (x 



= min D q K _ L 

p(x) 



p(X)p(x\x) \\p(X)p(x) 



(23) 



ous extension, because the Blahut-Arimoto scheme is synonymous with Boltzmann- 
Gibbs-Shannon statistics. 
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where 



P* ( x ) = y^,p(x)p(x\x). 



(24) 



Also 



- EP (x) p{x\x) \n q (JBjsfa) = - max £ p (x) p(x\x) In, (^) , (25) 



where by the Bayes theorem 
p (x) p(x\x) 



p [x\x) 



T,p(x)p(x\x)' 



(26) 



Note that (25) forms the basis for the computational implementation of the 
channel capacity within the Tsallis statistics framework. 



Proof: 



U K-L 



P (x,x)\\p(x)p(x 



U K-L 



p 



(x,x)\\p(x) P * (1) 



E P (x, x) In, (Iffl) + Ep (x, x) In, (^) 



In ( eMIeM) _ i n ( tSm^l 

1 \ p(x,x) J 1 V p(x,x) 



+ (i - q) K ("Wr) 



Ep(x,x)\n q (^) [l + (l-g)ln g (^|f)) 

£ p(*,*)+(i-9) £ p(*.*) w (^ifff 1 ) 



£p( a 



-EEfWf(^) K 1 + (Q ~ 1) " E P (x, x) ln g (^gfl) 



K-L 



p*(x) 



X 



1 + (g - 1) ££_ L [p (X, x) J p (X) p* (x 
< q < 1, 

0<Z^_ L [p(X;X) ||p(X)p*(X)] <1, 

o<^W) <i. 



>0; 
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Here, (a) sets up the expression to employ the q-deformed algebra definition 



1 + (1 - ln « (^SSr)] byp(x,x) 



(6) follows by multiplying and dividing 

and summing over x and x. Note that ~52p{x,x) = 1, (c) establishes the 
positivity condition and proves that 

p* (5) = y^ j p(x)p(x\x), (28) 



The second part of Lemma 1 is proven as follows, based on the validity of 
(24). From (24) and (26) 



p (x) p (x| x) = p(x)p* (x\ x) 



(29) 



Thus, the positivity condition is established as 

-Ep(x)p(x\x) In, + max £p(x)p(z| x) In, ( JgLj) 

X ^X PyX\Xj X ^X 

= -Ep(x)p(x\x) [ln^^y) - In, 
= - E P (*) p (x| x) In, (^) [l + (1 - q) In, ( p -g_)] 

^n(x,x)+(i- g )i;p(^K(^|) 



Ep(x)p(x\x) In, ($r&) 



(6) 



x 



x 



- EP (x) p* (x| x) In, (^jjf ) Ep x) + (1 - g) Ep (x, x) In, ( p(|>( , . , 

X ^X X ^X X ^x 

(-Ep*(x|x)ln,(|^)) 

( = } ep {x)D q K- L \p*(x\x = x)\ P (x\x = x) 

1 + (q - 1) L^_ L [p (X, 1) J p (X) p (X)]] > 0; 
< g < 1, 

0< J D|,_ L [p(X;X)||p(X)p(X)] <1, 
< D q K _ L \p*(X\X = x) p(X\X = x)} <1. 



Here, (a) sets up the expression to employ the q-deformed algebra definition 
[1^^ = ln « (f)> ^ ^tiplying and dividing by [l + (1 - q) (6) 
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evokes (29) followed by multiplying and dividing 1 + (1 - q) \n q ( p ^ p ff ) by 
p(x,x) and summing over x and x. Note that ^2p(x,x) = 1, (c) establishes 

the positivity condition. 

//ere, £/ie too parts of Lemma 1 ((23) and (25)) provide a generalized statistics 
framework for deriving both alternating minimization and alternating maxi- 
mization schemes, using the celebrated Csiszdr-Tusnddy theory. Note that the 
results in Boltzmann-Gibbs-Shannon statistics is obtained by setting q — 1 
(see Lemma 13.8.1 in [3]). 

Lemma 1 is of great importance in deriving generalized statistics extensions 
of the Expectation Maximization (EM) algorithm [39]. The complete efficacy 
of Lemma 1 will be demonstrated in a future publication, which studies the 
information bottleneck method [7] within the framework of Tsallis statistics. 

The positivity conditions (27) and (30) cannot be derived without the aid of 
q-deformed algebra [32]. It is important to note that the first part of Lemma 
1 establishes the fact that the marginal probability p*(x), defined by (24), 
minimizes the generalized mutual entropy and hence the generalized statistics 
RD Lagrangian (21), for the range < q < 1. Any other form of marginal 
probability other than (24), which may be obtained without the use of q- 
deformed algebra, could have a two-fold debilitating effect on the generalized 
statistics RD model presented in this paper, and the results of Lemma 1. First, 
Bayes' theorem (26) would be violated. Next, the possibility exists wherein 
the RD Lagrangian (21) could be maximized instead of being minimized, for 
certain values of q. 

Specifically, if the rules of q-algebra are neglected, an expression of the form: 
p* (x) = J2p (x) a ^ p (x\ xY^ q \ where a(q) and f3(q) are some functions of the 

X 

nonextensivity parameter q, is obtained. Apart from violating (26), such a 
form of p*(x) could result in a maximization of (21) for certain values of q, 
thereby invalidating the nonextensive alternating minimization procedure. 

Note that minimum D q K _ L [p(X)p(X\X)\\p(X)p(X)} is exactly the convex gen- 
eralized mutual entropy I q (X; X) calculated on the basis of the joint distribu- 
tion p(x)p(x\x). Thus, D q K _ L \p(X)p(X\X)\\p(X)p(X)] is an upper bound for 
the compression information term I q (X; X), with equality achieved only when 
p(x) is set to the marginal distribution of p(x)p(x\x). The above proposition 
encourages the casting of the generalized RD function as a double minimiza- 
tion 

R q (D) = min min D\_ L [p (X) P (x\x)\\p (X) p(x)]. (31) 

{p(x)\ {p(x\x):{d(x,x))<D\ L v ' v / J 



Given A a set of joint distributions p(x, x) with marginal p(x) that satisfy 
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the distortion constraint, and, if B is the set of product distributions p(x)p(x) 
with some normalized p(x), then 



R a (D) = min min D% 



K—L 



[a|| 6]. 



(32) 



Note that like the Blahut-Arimoto algorithm, the nonextensive alternating 
minimization scheme does not possess a unique solution. Extension of the 
above theory to the case of the dual generalized mutual entropy (parameterized 
by q* = 2 — q) is identical and straightforward. 



5 Nonextensive rate distortion variational principles 



This Section closely parallels the approach followed in Section 13.7 of [3]. 



5.1 CaseforO<q<l 



Lemma 2: Variational minimization of the Lagrangian 



L q RD [p(x\x)}=EEp(x)p(x\x)^-^ 

x x H 



(33) 



+(3 E E d (x, x) p (x)p ( x\ x) + E A (x) E V ( A x ) , 




yields the canonical conditional probability p(x\x) = 




P(x) cxp q * {-13* (x)d(x,x)) 
Z(x,f3*(x)) 



, where, 



Valid for both discrete and continuous cases 
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Proof: Taking the variational derivative of (33) for each x and x, yields 



5L R D \P( S: \ X )] 

8p(x\x) 



dp(x\x) 





p{x)p{x\x) q p{x) 1 ~ q — I 
9-1 



+ /?d (x, x) p (x) + A (x) = 
fSi (*gf> r 1 + £( f^fgg + ^ (*, x) p (x) + A (x) = 

W jpM /£(iW A 9 " 1 4. PfeM^l! 9 ( p(x\x) 1 -"p(x) 1 - q \ 

(9-1) V p(s) J ^ ap(5|x) V pH*) 1 ^ J ( 34 ) 

(x, x) p (x) + A (x) = 

Sp(x) (^)^ + + A(x)] = 0. 



In (34), (a) follows from Bayes rule by setting p (x) = ^^fjfj^j and, (6) follows 

A(») 
p(x) ' 



by defining A(x) = ^M. Thus, (34) yields 



1 (p (x| x) 
(9-1) V P(5) . 



9-1 



+ j3d(x,x) + A(x)] = 0. 



(35) 



Expanding (35), yields 

p (x| x) = p (x) (1 — g) | A (x) + (3d (x, x)| 



9-1 



(36) 



Multiplying the square bracket in (35) by p (x| x), summing over x, and, evok- 
ing ^2p {x\ X — x) — \ yields 



(37) 



Here, (x) = £p(x) (^y 1 )"- Thus (36) yields 



p (x| x) 



p(x){l-( q -l)^(x)d(x,x)} 1 ^"- 1 ^ 



1/(1-9)' 



3rd (x) = (x) + (g - 1) (d (x, x)) p(i | X=x) = (1 - g)A(x), 



(38) 



3 hsW « 9 (*)+(9-l)/3W!e.«)>p(s|x=«) ' 



where f3*(x) is the effective nonextensive trade- off parameter for a single source 
alphabet x G X. The nei effective nonextensive trade-off parameter, evaluated 
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for all source alphabets x G X is: (3* = J2P*(x)- Note that the parameter 

X 

(3* (x) explicitly manifests the self-referential nature of the canonical transition 
probability p(x\x). 

The present case differs from the analysis in [21] in the sense that (3*{x) is to 
be evaluated for each source alphabet x G X. Thus, the so-called parametric 
perspective employed in [21], by a-priori defining a range for (3*{x) G [0, oo] 
is not possible. Instead the canonical transition probability p(x\x) has to be 
evaluated for each f3*{x) for all x G X, and, for each (3. This feature rep- 
resents a pronounced qualitative and quantitative distinction when studying 
the stationary point solutions of conditional probabilities, as compared with 
stationary point solutions of marginal probabilities. 

Specifying q = 2 — q* in the numerator of (38) and evoking (4), yields the 
canonical transition probability 

p(x) exp^ [-(3* (x) d (x, x)} _ p (x) exp qt [-/?* (x) d (x, x)} 
P[XlX) ~ Ep(x)exp q *[-(3*(x)d(x,x)}~ ' Z(x,(3*(x)) 

X 

The partition function for a single source alphabet x G X is 

Z (x, (3* (x)) = 3^(x)=Ep (x) exp r [-(3* (x) d (x, x)] . (40) 

X 



Solutions of (39) are only valid for {1 — (1 — q*) (3* (x) d (x, x)} > 0, ensuring 
p(x\x) > 0. This is the is the Tsallis cut-off condition [1, 21]. The condition 
{1 - (1 - q*) (3* (x) d (x, x)} < requires setting p{x\x) = 0. From (38)-(40), 
(3, (3*, and, (3* (x) relate as 



(3* (x) = 



J_ 



K q (x)+(q-l)P{d(x,x)) p(s \ x) 

(x). 





Z( 1 -i)(x,/3*(x)) ' 



(41) 



5.2 Case for q > 1 



As discussed in Sections 2 and 3 (Theorem 3) of this paper, use of the additive 
duality is required to express the generalized mutual entropy for the range 
of the nonextensivity parameter q > 1 as: D q K _ L [p(X, X)\\p{X)p{X)]. This is 
done in accordance with the form required for the nonextensive alternating 
minimization procedure (31) (Section 4.2) described in q* — space, and, the 
convention followed in [3]. 
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Lemma 3: Variational minimization of the Lagrangian 12 

'p(x\x)\1*-l 1 



L q RD \p(x\x)} =J2Ep(x)p(x\x)- 



X X 



q*-l 



+P q * T,T,d{x,x)p (x)p (x|x) +£A(a;)£p(£|z), 



(42) 



a; x 



yields the canonical conditional probability p(x|x) = e * Pg ( ^2 iW^ 1 )) ^ 
where, 0,. = g*/?, (x) = (1 J q f )Hx) , A (x) = and, EEp(x,x) = 1. 



Proof: In accordance with the procedure employed in Section 5.1, variational 
minimization of the Lagrangian in (42) yields 



<5p(x|x) 



p (x) 



^- 1 (^f~ 1 +~^d(x,x) + ~X(x) 



0. (43) 



Solving (43) for p(x\x) and obtaining the normalization Lagrange multiplier 
analogous to the approach in Section 5.1, yields 



(l-q*){\(x)+P q *d(x,x)} 



q*-l 



p{x\x) = p [X) 
and, 



(44) 



H q * (x) 



Here, (44) yields 

X) = P(^)exp g (-/?;_ g (x)ci(x,x)) 

9&d (*) = *V (*) + («* - l)flr (d {x, x)) p(ilx=x) , (45) 



/Q* I rf.\ - Pq* — P< 
b>2-q \ x ) — c** f 



p(x|X=x) 
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V Oz)+(g*-l)/V (^^)> p(5 i|x=,) ' 



Here, (45) yields 

x p(x)exp q (-Pl_ (x)d(x,x)) / /( c\ 

p x x = v — -■ 46 

FV 1 7 Z 2 ^ q (x,f3*_ q (x)) V ' 



12 Valid for both discrete and continuous cases 
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Solutions of (46) are only valid for jl — (1 — q) /5£_ 9 (a;)d (x, x)| > 0, ensuring 
p(x\x) > 0. This is the Tsallis cut-off condition [1, 21] for (46), for the range 
1 < q < 2. The condition jl — (1 — q) (3%_ q (x)d (x, x)| < requires setting 
= 0. The partition function is 

Z 2 - q (x,(3* 2 _ q (x)) = (%* RD (x))^ = Ep(£)exp g [-(3*_ q (x)d(x,x)]. (47) 



From (45)-(47), /£L 9 , and, (3%- q (x) relate as 

■«*)/'_ o» /..\\ ' 

(48) 



-9 



02-q ~ ^2 P2~q( X ) ■ 



5.3 Nonextensive alternating minimization algorithm revisited 



This sub-Section describes the practical implementation of the nonextensive 
alternating minimization algorithm for the case < q < 1. This is accom- 
plished using the theory presented in Sections 4.1 and 5.1. For the sake of 
brevity, the implementation is described in point form: 

• A-priori specifying the nonextensivity parameter q and the effective nonex- 
tensive trade-off parameter for a single source alphabet j3*{x) obtained from 
(38) for all source alphabets, the expected distortion D =< d(x, x) > p ( x ,z) is 
obtained. Choosing a random data point in X (the convex set of probability 
distributions B, described in Sections 4.1 and 4.2), an initial guess for p(x) is 
made. Eq. (39) is then employed to evaluate the transition probability p(x\x) 
(a single data point in the convex set of probability distributions A, described 
in Sections 4.1 and 4.2), that minimizes the generalized K-Ld D q K _ L [m] subject 
to the distortion constraint. 

• Using this value of p(x\x), (28) is employed to calculate a new value of p(x) 
that further minimizes D q K _ L [»\. 

• The above process is repeated thereby monotonically reducing the right 
hand side of (31). Using Lemma 2 ((39)) and (21), the algorithm is seen 
to converge to a unique point on the RD curve whose slope equals —$. In 
principle, for different values of f3*{x) obtained for all source alphabets, a full 
RD curve may be obtained. 

• Note that the alternating minimization is performed independently in the 
two convex sets of probability distributions A and B (see Section 4.2). Specif- 
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ically, p(x) is assumed fixed when minimizing with respect to p(x\x). In the 
next update step, assuming p(x\x) to be fixed, p(x) is minimized through (28). 

• In general, the alternating minimization algorithm only deals with the op- 
timal partitioning of X (induced by p(x\x)), with respect to a fixed set of 
representatives ( X values). This implies that the distortion measure d(x,x) 
is pre-defined and fixed throughout the implementation Vx G X and Vx G X . 

The alternating minimization algorithm for nonextensive RD theory is de- 
scribed in Algorithm 1 for < q < 1. 

Algorithm 1 Nonextensive Alternating Minimization Scheme for < q < 1 
Input 

1. Source distribution p(x) G X. 

2. Set of representatives of quantized codebook given by p (x) G X values. 

3. Input nonextensive trade-off parameter f3{— q/3), where, j3 is the 
Boltzmann-Gibbs-Shannon inverse temperature. 

4. Distortion measure d(x,x). 

5. Convergence parameter e. 
Output 

Value of R q (D) where its slope equals —$ = —q/3. 
Initialization 

Initialize and randomly initialize p(x) and p(x\x) (to initialize P*(°\x)). 
While True 

• 8*^(x) = 7-7-^7— t-q 13 > Effective nonex- 



tensive trade-off parameter for a single source) (41) 

.Jm+l)/.|^ , P (m) (g) cx Pg* (~/3* (m) EMM)) 

• !^x|Xj^ z(' n + 1 )(x,/3*( m )) 
p (m+l) ^_ Yp(x)p (m+ ^ (x \x) 



R (m+1) (Qj = D Q k _ l ^ p ( x j p (m+l) (x\ X )\ \p (x) p (m+1) (x) 

If (i?( m ) (D) - i?J m+1 ) (£>)) < £ 



Break 

• Test Tsallis cut-off condition. 

• <- p + 5(3 



6 Numerical simulations and physical interpretations 



The qualitative distinctions between nonextensive statistics and extensive 
statistics is demonstrated with the aid of the respective RD models. To this 
end, a sample of 500 two-dimensional data points is drawn from three spherical 
Gaussian distributions with means at (2, 3.5), (0, 0), (0, 2) (the quantized code- 
book). The priors and standard deviations are 0.3,0.4,0.3, and, 0.2,0.5,1.0, 
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respectively. The distortion measure d(x, x) is taken to be the Euclidean square 
distance. The case < q < 1 (Section 5.1) is chosen for the numerical study. 
The axes of the nonextensive RD curves are scaled with respect to those 
of the extensive Boltzmann-Gibbs-Shannon RD curve. The nonextensive RD 
numerical simulations are performed for three values of the nonextensivity 
parameter (i) q = 0.85, (ii) q = 0.70, and, (Hi) q = 0.5, respectively. The 
nonextensive RD model demonstrates extreme sensitivity to the problem size 
(model complexity) and the nature of the test data. 

Fig. 2 depicts the extensive and nonextensive RD curves. Each curve has been 
generated for values of the Boltzmann-Gibbs-Shannon inverse temperature 
(3 G [.1,2.2]. Note that in this paper, the nonextensive trade-off parameter is: 
$ = qf3. It is observed that the nonextensive RD theory exhibits a lower thresh- 
old for the minimum achievable compression-information in the distortion- 
compression plane, as compared to the extensive case, for all values of q in the 
range < q < 1 . The nonextensive RD curves are upper bounded by the 
Boltzmann-Gibbs-Shannon RD curve. 

Specifically, as described in Section 5.3 of this paper, first the transition prob- 
ability p(x\x) is obtained by solving (39), followed by a correction of the 
marginal probability p(x) using (24). This process is iteratively applied for 
each individual value of (3 till convergence is reached. The value of $ is then 
marginally increased, resulting in the nonextensive RD curve. This is the crux 
of the nonextensive alternating minimization procedure. The nonextensive RD 
curves in Fig. 2 are truncated, by terminating the nonextensive alternating 
minimization algorithm when the Tsallis cut-off condition is breached. 

Note that for the nonextensive cases, the slope of the tangent drawn at any 
point on the nonextensive RD curve is the negative of the nonextensive trade- 
off parameter —$ = —q/3. Data clustering may be construed as being a form of 
lossy data compression [40]. At the commencement, the nonextensive alternat- 
ing minimization algorithm solves for the compression phase with (3 — > 0. The 
compression phase is characterized by all data points "coalescing" around a 
single data point x G X, in order to achieve the most compact representation 
(R q (D)^0). 

As increases, the data points undergo soft clustering around the cluster 
centers. By definition, in soft clustering, a data point x G X is assigned to 
a given cluster whose centers are x G X through a normalized transition 
probability p(x \ x). The hard clustering regime signifies regions where (3 — > oo. 
By definition, in hard clustering the assignment of data points to clusters is 
deterministic. 

An observation of particular significance is revealed in Fig.2. Specifically, even 
for less relaxed distortion constraints (d (x, x)) , any nonextensive case for 
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< q < 1 possesses a lower minimum compression information than the 
corresponding extensive case. The threshold for the minimum achievable com- 
pression information I q (X; X) decreases as q — > 0. Note that all nonextensive 
RD curves inhabit the non- achievable region for the extensive case. By defi- 
nition, the non- achievable region is the region below a given RD curve, and 
signifies the domain in the distortion- compression plane where compression 
does not occur. 

Further, nonextensive RD models possessing a lower nonextensivity parameter 
q inhabit the non- achievable regions of nonextensive RD models possessing a 
higher value of q. These features imply the superiority of nonextensive models 
to perform data compression vis-a-vis any comparable model derived from 
Boltzmann-Gibbs-Shannon statistics. 



7 Summary and conclusions 

Variational principles for a generalized RD theory, and, a dual generalized RD 
theory employing the additive duality of nonextensive statistics, have been 
presented. This has been accomplished using a methodology to "rescue" the 
linear constraints originally employed by Tsallis [1], formulated in [21]. Select 
information theoretic properties of dual Tsallis uncertainties have been inves- 
tigated into. Numerical simulations have proven that the nonextensive RD 
models demonstrate a lower threshold for the compression information vis-a- 
vis equivalent models derived from Boltzmann-Gibbs-Shannon statistics. This 
feature acquires significance in data compression applications. 

The nonextensive RD models and the nonextensive alternating minimization 
numerical scheme studied in this paper represent idealized scenarios, involving 
well behaved sources and distortion measures. Based on the results reported 
herein, an ongoing study has treated a more realistic generalized RD scenario 
by extending the works of Rose [41] and Banerjee et. al. [42], and has accom- 
plished a three-fold objective. 

First, a generalized Bregman RD (GBRD) model has been formulated using 
the nonextensive alternating minimization algorithm as its basis. Let X s be a 
subset of X, where p(x) ^ (the support). From a computational viewpoint, 
the GBRD model represents a non-convex optimization problem, where the 
cardinality of X s ( \X S \) varies with increase in the nonextensive trade-off pa- 
rameter. Next a Tsallis- Bregman lower bound for the RD function is derived. 
The Tsallis- Bregman lower bound provides a principled theoretical rationale 
for the lower threshold for the compression information demonstrated by gen- 
eralized statistics RD models, vis-a-vis equivalent extensive RD models. 
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Finally, the problem of rate distortion in lossy data compression is shown to 
be equivalent to the problem of mixture model estimation in unsupervised 
learning [43]. This is demonstrated for q-deformed exponential families of dis- 
tributions [44, 45]. The primary rationale for this exercise is to solve the gen- 
eralized RD problem employing an Expect at ion- Maximizat ion -like algorithm 
[39] , using the results of Lemma 1 as a basis. Results of these studies will be 
presented elsewhere. 
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FIGURE CAPTIONS 



Fig. 1: Schematic diagram for rate distortion curve. 

Fig. 2: Rate distortion curves. Boltzmann-Gibbs-Shannon model (solid line), 
generalized statistics RD model for q = 0.85 (dash-dots), q = 0.70 (dashes), 
and, q = 0.5 (dots). 
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